Skip to content

Conversation

@brian-dellabetta
Copy link
Collaborator

@brian-dellabetta brian-dellabetta commented Oct 20, 2025

SUMMARY:
Upgrade the lm_eval vision languge tests from Qwen 2.5 to Qwen 3. After updating to include apply_chat_template, the scores closely align with what was achieved with Qwen 2.5

  • switch to neuralmagic/calibration dataset, based on suggestion here, to avoid tracing issues related to VL dataset.
  • switch to chartqa task, to increase number of samples to 500 and reduce variance in accuracy.
  • pruned unused datasets (slimorca and llm_compression_calibration)

TEST PLAN:
The 3 lm_eval VL tests were run, and the accuracies were updated

  • vl_fp8_dynamic_per_token.yaml runs in ~29m
  • vl_int8_w8a8_dynamic_per_token.yaml runs in ~37m
  • vl_w4a16_actorder_weight.yaml runs in ~34m

@github-actions
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use mmmu_val instead of the literature task? This gives us around 0.53 for the dense model?

@brian-dellabetta
Copy link
Collaborator Author

Why not just use mmmu_val instead of the literature task? This gives us around 0.53 for the dense model?

mmmu_val is 900 evals total instead of 30. that would add probably ~40 minutes to each lm-eval run, and we run two for each config, so total test time would increase over 3 hours with that change

@dsikka
Copy link
Collaborator

dsikka commented Oct 20, 2025

Why not use mmmu_val

Why not just use mmmu_val instead of the literature task? This gives us around 0.53 for the dense model?

mmmu_val is 900 evals total instead of 30. that would add probably ~40 minutes to each lm-eval run, and we run two for each config, so total test time would increase over 3 hours with that change

The 30 datapoints has proven to be very noisy historically. A happy medium might be better but we should also just validate the runtime for batch size of 100

_template true

Signed-off-by: Brian Dellabetta <[email protected]>
@brian-dellabetta brian-dellabetta force-pushed the bdellabe/qwen3-vl-lmeval branch from ea00c16 to 57e50b1 Compare October 21, 2025 21:57
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
Signed-off-by: Brian Dellabetta <[email protected]>
@brian-dellabetta brian-dellabetta marked this pull request as ready for review October 22, 2025 20:19
kylesayrs
kylesayrs previously approved these changes Oct 24, 2025
Copy link
Collaborator

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woop

Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 3 questions:

  1. Do we need to keep all the datasets in the testing_utils? Are thre some there that we can remove?
  2. Is 100 enough?
  3. Did we mention to MLR the variation we see without the chat_template?

@brian-dellabetta
Copy link
Collaborator Author

LGTM. 3 questions:

1. Do we need to keep all the datasets in the testing_utils? Are thre some there that we can remove?

2. Is 100 enough?

3. Did we mention to MLR the variation we see without the chat_template?

@dsikka thanks, see responses below:

  1. we could prune the code in testing_utils. should i get rid of gsm8k, open-platypus and slim-orca?
  2. I can up this as well. I set it to 100 because tests were taking forever, but i think the cpu of the cluster was just under heavy load when i was trying. 500?
  3. I can do so Monday

rahul-tuli
rahul-tuli previously approved these changes Oct 27, 2025
Copy link
Collaborator

@rahul-tuli rahul-tuli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@dsikka
Copy link
Collaborator

dsikka commented Oct 28, 2025

LGTM. 3 questions:

1. Do we need to keep all the datasets in the testing_utils? Are thre some there that we can remove?

2. Is 100 enough?

3. Did we mention to MLR the variation we see without the chat_template?

@dsikka thanks, see responses below:

  1. we could prune the code in testing_utils. should i get rid of gsm8k, open-platypus and slim-orca?
  2. I can up this as well. I set it to 100 because tests were taking forever, but i think the cpu of the cluster was just under heavy load when i was trying. 500?
  3. I can do so Monday

Yeah I think we shoud up to 500 and remove any testing dataset that we're not using

@brian-dellabetta brian-dellabetta added the ready When a PR is ready for review label Oct 28, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready When a PR is ready for review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants